[Quantization] add humming kernel support for deepseek v4#24289
[Quantization] add humming kernel support for deepseek v4#24289jinzhen-lin wants to merge 11 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the "Humming" quantization backend and MoE runner, adding optimized Triton and CUDA kernels for specialized quantization formats like MXFP4. The feedback highlights critical issues such as a potential memory leak in runner registration, possible out-of-bounds memory access in the Triton kernel, and problematic in-place configuration modifications. Additionally, the review suggests fixing a typo in attribute mapping, removing redundant rounding operations, and handling variable data type sizes more accurately during memory allocation.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…m.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
cc @Fridge003 for hopper w4a16 kernels |
|
Hello,does DeepSeek-V4 Pro can use the humming kernel? |
|
It should be supported, but I haven't actually run it myself yet. Welcome to try and feedback. |
|
@jinzhen-lin Hi, fix |
|
Fix applying the 2604B SwiGLU clamp/checker path jinzhen-lin#2 |
|
fix DeepEP empty-token path error jinzhen-lin#3 |
|
|
Bug: Garbled token insertion with Humming MXFP4 + DeepEP on H200 Environment
Symptom Occurs in both thinking chain and final content (independently, never both in the same request). Trigger Ruled out
Hypothesis |
|
@txh1873749380 Which specific commit are you using? Does it include commit ff72f25 ? I suspect it might be the SwiGLU clamp issue, but that should have been fixed already. |
|
@jinzhen-lin Agreed. I suspect the SwiGLU clamp too. Working on a proper fix. |
|
@jinzhen-lin I'm on a commit from before last week, so I likely don't have ff72f25. Checked the recent changes and they overlap almost exactly with what I'm working on. Still figuring out the right clamp fix. |
This PR add humming kernels to SGLang. This PR is based on #23754 , adding and improving support for DeepSeek V4 on top of it.
Humming Kenrels: https://github.com/inclusionAI/humming
vLLM supports:
Humming is a universal, high-performance quantization kernel (similar to the Marlin kernel), but offers several advantages over Marlin:
Benchmark
Service start command
Benchmark command
Benchamrk result (TPS)
In SGLang,
splitkv_mlaandpaged_mqaare used for the prefill part of DeepSeek V4, and the attention part takes longer than expected. If fixed, Humming is expected to achieve a greater e2e improvement.